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ABSTRACT 


The  Pilot  Prediction  System  (PPS)  is  a  research  effort  designed  to  provide  Navy  policy  makers  with  improved 
access  to  selection  and  training  data  in  the  aviation  community.  One  of  its  main  features  is  the  ability  to  make 
predictions  about  the  future  success  of  aviation  candidates  in  flight  training.  The  purpose  of  this  report  is  to 
present  in  some  detail  the  statistical  foundations  of  this  feature  of  the  PPS.  We  first  describe  the  rudiments  of 
statistical  decision  theory.  Such  a  theory  allows  us  to  make  the  very  best  decision  possible  when  faced  with  the 
inherent  uncertainty  about  what  is  actually  going  to  take  place  in  the  future.  The  second  pillar  on  which  the  PPS  is 
built  is  the  treatment  of  probability  from  the  Bayesian  perspective.  From  readily  acceptable  axioms  such  as  the 
sum  rule  and  the  product  rule,  the  Bayesian  approach  is  able  to  process  information  in  a  proven  optimal  manner. 
Information  about  an  aviation  candidate  in  the  form  of  scores  on  a  test  battery  is  readily  available  and  we  would 
like  to  make  the  best  classification  of  the  candidate  as  a  success  or  failure  in  flight  training  based  on  this  kind  of 
information.  Specifically,  the  concept  of  a  predictive  probability  density  is  developed  within  the  Bayesian 
formalism.  Once  a  general  prediction  algorithm  has  been  constructed  with  the  help  of  statistical  decision  theory, 
the  details  about  the  necessary  probabilities  are  provided  by  the  Bayesian  approach.  The  prediction  algorithm  is 
one  of  the  core  modules  of  the  PPS.  It  assigns  candidates  as  likely  passes  or  failures  during  some  phase  of  flight 
training  on  the  basis  of  preexisting  data  and  performance  on  a  test  battery.  A  test  battery  might  include  such  things 
as  night  visual  acuity,  cognitive  information  processing,  psychomotor  skills,  and  personality  assessment.  This 
decision  to  predict  a  pass  or  fail  for  a  given  candidate  is  taken  to  minimize  the  average  monetary  loss  over 
whatever  happens  in  the  future.  Two  technical  appendices  are  included  for  the  interested  reader.  The  first  contains 
a  simplified  proof  of  the  Bayesian  predictive  density  that  allows  the  prediction  algorithm  in  the  PPS  to  be  written 
in  its  most  general  form.  The  second  shows  an  analytical  solution  derived  from  the  theory  developed  in  the  first 
appendix  which  justifies  a  practical  approximation  for  predicting  pass  or  fail  during  flight  training  for  a  candidate 
participating  in  a  selection  test  battery. 
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INTRODUCTION 


The  Pilot  Prediction  System  (PPS)  is  a  research  effort  designed  to  provide  Navy  policy  makers  with  improved 
access  to  selection  and  training  data  in  the  aviation  community.  One  of  its  main  features  is  the  ability  to  make 
predictions  about  the  future  success  of  aviation  candidates  in  flight  training.  Information  about  the  goals  and  initial 
results  of  the  PPS  is  referenced  in  Blower  [1].  The  purpose  of  this  report  is  to  present  in  some  detail  the  statistical 
foundations  behind  the  PPS.  The  theory  detailed  here  is  also  applicable  to  any  selection  system. 

We  first  describe  the  rudiments  of  statistical  decision  theory.  Such  a  theory  allows  us  to  make  the  very  best 
decision  possible  when  faced  with  the  inherent  uncertainty  about  what  is  actually  going  to  take  place  in  the  future. 
Within  this  theory,  uncertainty  about  the  future  state  of  the  world,  or  in  our  particular  case,  the  future  outcome  of 
flight  training  for  a  Navy  or  Marine  Corps  Officer,  is  handled  quantitatively.  The  first  part  of  the  quantitative 
solution  chooses  in  favor  of  a  prediction  that  results  in  the  minimum  cost.  This  minimum  cost  is  formed  by 
averaging  the  costs  over  all  the  possibilities  that  could  take  place.  When  averaging  is  mentioned,  the  notion  of 
probability  must  be  invoked.  Uncertainty  dealt  with  by  probability  theory  is  the  second  part  of  the  quantitative 
solution. 

The  second  pillar  on  which  the  PPS  is  built  is  the  treatment  of  probability  from  the  Bayesian  perspective.  From 
readily  acceptable  axioms  such  as  the  sum  rule  and  the  product  rule,  the  Bayesian  approach  is  able  to  process 
information  in  a  proven  optimal  manner.  Information  about  an  aviation  candidate  in  the  form  of  scores  on  a  test 
battery  is  readily  available  and  we  would  like  to  make  the  best  classification  of  the  candidate  as  a  success  or 
failure  in  flight  training  based  on  this  information.  More  specifically,  we  emphasize  the  concept  of  a  predictive 
probability  density  as  it  is  defined  within  the  Bayesian  formalism. 

Once  a  general  prediction  algorithm  has  been  constructed  with  the  help  of  statistical  decision  theory,  the  details 
about  the  necessary  probabilities  are  provided  by  the  Bayesian  approach  The  prediction  algorithm  is  one  of  the 
core  modules  of  the  PPS.  It  assigns  candidates  as  likely  passes  or  failures  during  some  phase  of  flight  training  on 
the  basis  of  preexisting  data  and  performance  on  a  test  battery.  A  test  battery  might  include  such  things  as  night 
visual  acuity,  cognitive  information  processing,  psychomotor  skills,  and  personality  assessment.  This  decision  to 
predict  a  pass  or  fail  for  a  given  candidate  is  taken  to  minimize  the  average  monetary  loss  over  whatever  happens 
in  the  future. 

Mary  numerical  examples  are  presented  the  course  of  this  paper  and  some  care  is  taken  to  explain  such 
concepts  as  a  loss  matrix,  likelihood  ratios,  beta,  and  cut-off  scores.  The  final  example  illustrates  an  important 
approximation  based  on  the  Gaussian,  or  Normal,  curves.  The  case  examined  here  is  especially  simple  since  it  is 
based  on  just  one  composite  score.  A  more  complicated  example  involves  the  multivariate  Normal  when  a  vector 
of  means  and  the  covariance  matrix  are  involved.  This  more  difficult  case  is  treated  in  Blower  [1]  based  on  the 
exposition  by  Geisser  [2]  and  Press  [3], 

Two  technical  appendices  are  included  at  the  end  of  the  paper.  Appendix  A  shows  how  the  Bayesian  predictive 
density  is  arrived  at  from  simple  first  principles.  Working  out  the  details  for  this  case  reveals  that  the  predictive 
density  is  merely  an  average  likelihood  for  the  composite  score  obtained  by  the  candidate.  This  average  likelihood 
is  constructed  from  the  posterior  density  for  all  the  parameters  that  determine  the  likelihood.  The  posterior 
probability  distribution  is  where  the  information  (and  uncertainty)  from  all  the  past  cases  is  stored.  In  these  past 
cases,  both  the  data  and  training  outcome  were  known.  We  try  to  leverage  the  information  contained  in  the 
posterior  density  to  predict  a  training  outcome  for  a  new  candidate  when,  now,  only  the  scores  from  the  test  battery 
are  available. 

The  final  appendix.  Appendix  B,  presents  an  analytical  derivation  (well-known  in  the  Bayesian  literature)  based 
on  the  formal  theory  developed  in  Appendix  A.  We  show  here  how  the  Normal  curves  used  in  practice  for  the 
composite  scores  can  be  justified  from  the  Bayesian  standpoint. 
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STATISTICAL  DECISION  THEORY 

Statistical  Decision  Theory  (SDT)  answers  the  following  question:  How  do  I  take  the  best  course  of  action 
when  faced  with  an  uncertain  future?  It  turns  out  that  this  is  the  same  question  that  selection  test  batteries  face 
when  trying  to  make  the  best  classification  of  a  candidate  based  on  performance.  The  precepts  of  SDT  form  the 
fundamental  basis  for  all  selection  test  batteries.  The  PPS.  as  the  core  module  for  a  test  battery  to  select  Navy  and 
Marine  Corps  Pilots  and  Flight  Officers,  relies  heavily  upon  SDT.  The  basic  principle  of  SDT  is  easy  to  grasp. 


Take  that  course  of  action  which  results  in  the  minimum  average  loss. 


good  introduction  to  SDT  from  a  general  statistical  standpoint  is  contained  in  Berger  [4],  An  exposition  of  these 
principles  for  psychologists  in  the  context  of  signal  detection  theory  is  well  presented  in  [5],  SDT  consists  of  three 
basic  elements:  1)  states  of  the  world,  2)  possible  decisions  (or  actions),  and  3)  uncertainty  about  the  state  of  the 
world.  The  standard  notation  for  these  concepts  is  as  follows.  The  parameter  theta,  0,  stands  for  the  set  of  the  n 
states  of  the  world, 

0  =  {#1,  02,  •  •  ■  On}. 

A  stands  for  the  set  of  the  m  actions  that  could  be  taken  in  conjunction  with  the  states  of  the  world. 


A  —  (fli,  a2  ‘  '  ‘  ®ro} • 


The  words  actions  and  decisions  are  used  synonymously  in  SDT.  The  notation  ai,  a2,  ■  ■  •  am  is  designed  to 
eliminate  confusion  with  the  notation  di,d2  -  ••  dm,  which  is  reserved  for  referring  to  data  points.  The  uncertainty 
about  the  j th  possible  state  of  the  world  is  captured  by  a  probability  distribution  with  the  notation  of  P(0j). 

It  turns  out  that  the  main  problem  of  concern  for  the  PPS  is  especially  easy  for  SDT.  There  are  only  two  states 
of  the  world  and  two  possible  actions.  Therefore, 


0  =  {0xM 


and 


A  =  {ai,a2}. 


Here,  6\  represents  the  state  of  the  world  when  a  flight  candidate  passes  some  phase  of  flight  training,  and  02 
represents  the  state  of  the  world  when  a  flight  candidate  fails  some  phase  of  flight  training.  For  this  discussion, 
assume  that  the  passing  and  failing  criterion  refers  to  the  training  that  takes  place  through  advanced  flight  training. 
Only  two  possible  decisions  will  be  considered  in  conjunction  with  these  states  of  the  world.  They  are  ax, 
representing  the  decision  to  predict  pass,  and  a2,  representing  the  decision  to  predict  fail. 


A  principle  construct  within  SDT  is  the  loss  matrix  for  all  possible  combinations  of  6  and  A.  With  only  two 
states  of  the  world  and  two  possible  actions,  the  loss  matrix  is  a  2  x  2  matrix.  Each  one  of  the  four  cells  of  the 
loss  matrix  represents  the  cost  in  dollars  for  the  particular  combination  of  the  state  of  the  world  and  the  action  of 
that  cell.  The  loss  is  expressed  as  L(0j,  ak)  for  the  jth  state  of  the  world  and  the  fcth  decision.  The  2x2  matrix 
that  the  PPS  uses  resembles  the  one  in  Table  1.  A  more  specifically  labeled  loss  matrix  is  presented  as  Table  2. 


Table  1:  A  generic  loss  matrix  for  the  decision  problem  concerning  two  states  of  the  world  and  two  possible  actions. 


Ox 

02 


al  a2 


L{0\,  ai) 

L(6i,ci2) 

L{02,  ai) 

L{02,a2) 

2 


Table  2:  A  particular  loss  matrix  for  the  decision  problem  concerning  the  selection  of  candidates  to  enter  flight 
training.  Comparing  with  Table  1,  note  that  the  two  actions,  a\  and  a 2,  correspond  to  Predict  Pass  and  Predict 
Fail,  while  the  two  states  of  the  world,  0\  and  02,  correspond  to  Actual  Pass  and  Actual  Fail 


Actual  Pass 
Actual  Fail 


Predict  Pass  Predict  Fail 


0 

C, 

C2 

0 

In  the  transition  from  Table  1  to  Table  2,  it  can  be  seen  that  the  two  actions  are  predict  pass  and  predict  fail 
and  the  two  states  of  the  world  are  pass  advanced  flight  training  and  fail  advanced  flight  training.  From  the 
structure  of  the  loss  matrix  we  can  observe  that  there  are  two  ways  to  make  a  correct  decision  as  well  as  two  ways 
to  make  an  incorrect  decision.  A  correct  decision  occurs  when  1)  the  action  predict  pass  is  taken  and  the  state  of 
the  world  is  actual  pass  and  2)  the  action  predict  fail  is  taken  and  the  state  of  the  world  is  actual  fail.  For  these 
two  correct  decisions  a  loss  of  0  is  assigned.  Similarly,  an  incorrect  decision  occurs  when  1)  the  action  predict  fail 
is  taken  and  the  state  of  the  world  is  actual  pass  and  2)  the  action  predict  pass  is  taken  and  the  state  of  the  world 
is  actual  fail.  For  these  two  incorrect  decisions  losses  of  C\  and  C2,  respectively,  are  assigned.  Subsequently,  we 
will  investigate  the  impact  on  the  decisions  taken  when  these  costs  are  varied. 

Average  loss 

The  issue  of  an  average  loss  is  an  important  one  and  we  address  it  here.  In  the  parlance  of  statistics, 
expectation  is  the  same  as  an  average.  For  example,  to  average  the  three  numbers  1,2,  and  3,  one  adds  the  numbers 
and  divides  by  three,  yielding  an  average  of  2.  Written  as  a  formula,  this  operation  is 

E(x)  =  x='AL*i  (1) 

where  E(x)  is  the  expectation  operator  for  x,  x  is  the  sample  average,  N  =  3,  and  x,  —  1,  x2  =  2,  and  x3  =  3. 
The  correct  formula  for  the  expectation  of  any  discrete  function  of  x  is 

E[f{x)]  =  f{xj)  P(xj).  (2) 

j=i 

The  simple  formula  of  Equation  (1)  has  been  generalized  to  its  correct  definition  as  a  mathematical  expectation  of 
f  (x).  In  Equation  (1),  each  x  was  weighted  equally  in  finding  the  average.  Equation  (2)  generalizes  to  the  case 
where  each  function  of  x  is  weighted  according  to  its  probability  of  occurrence.  If  f{xj)  =  xj,  n  —  3,  and 
P(xj)  =  1/3,  then  Equation  (2)  describes  the  same  situation  and  gives  the  same  answer  as  Equation  (1). 

E[f{x)]  =  (1  X  1/3)  +  (2  x  1/3)  +  (3  x  1/3)  =  2 


Now  that  we  possess  a  general  formula  for  the  expectation  of  any  function,  we  may  substitute  the  loss, 

L(0j ,  ak),  for  f(x)  where  x  is  now  the  state  of  the  world,  6j.  By  doing  this,  we  are  finding  an  expected  loss  with 
respect  to  the  uncertain  states  of  the  world.  If  we  fix  k,  then  we  are  finding  an  expected  loss  for  a  fixed  decision. 
There  are  only  two  decisions,  so  k  ==  1  or  k  ~  2.  Thus,  we  will  be  finding  only  two  expected  losses  over  the 
possible  states  of  the  world.  The  formulas  for  the  expected  loss  of  a  predicted  pass,  k  —  1,  and  the  expected  loss 
of  a  predicted  fail,  k  =  2,  will  now  be  derived. 


expected  loss  of  decision  ak  =  L(0j,  ak)  P{0j).  (3) 

i 
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With  only  n  =  2  states  of  the  world. 


expected  loss  ax  — 

L(91,a1)P(e1)  +  L(e2,a1)P(62) 

(4) 

expected  loss  predicted  pass  = 

[0  X  P( Pass)]  +  [C2  X  P(Fail)] 

(5) 

= 

C2  x  P(Fail) 

(6) 

expected  loss  a2  = 

L(6i,  a2)P(01)  +  L(02,  a2)P(02) 

(7) 

expected  loss  predicted  fail  = 

[Ci  x  P( Pass)]  +  [0  x  P(Fail)] 

(8) 

= 

Ci  x  P( Pass) 

(9) 

We  now  invoke  the  basic  principle  of  SDT:  Choose  that  decision  that  results  in  the  minimum  expected  loss.  The 
decision  rule  itself  is  quite  simple.  If  the  expected  loss  of  the  predicted  pass  is  less  than  or  equal  to  the  expected 
loss  of  the  predicted  fail,  then  predict  pass,  otherwise  predict  fail. 


Expected  loss  predicted  pass 

< 

Expected  loss  predicted  fail 

(10) 

C2  x  P(Fail) 

< 

Ci  x  P(Pass) 

(11) 

C2 

Cl 

< 

P(Pass) 

P(Fail) 

(12) 

P(  Pass) 
P(Fail) 

> 

c2 

Cl 

(13) 

Then 

Predict  Pass 

Else 

Predict  Fail 

Numerical  Examples 

This  section  presents  some  numerical  examples  of  the  decision  rule  as  given  by  Equation  (13).  We  will  first 
examine  the  case  where  it  is  more  expensive  to  train  eventual  failures  than  to  reject  some  candidates  who  would 
have  succeeded.  To  solve  this  problem,  it  is  first  necessary  to  fill  in  the  loss  matrix  that  reflects  this  particular 
scenario.  Table  3  presents  an  example  of  such  a  situation. 

Table  3:  The  loss  matrix  for  a  PPS  decision  problem  of  choosing  candidates  to  enter  flight  training  when  it  is  more 
expensive  to  train  eventual  failures  than  to  reject  some  successful  candidates. 


Predict  Pass 

Predict  Fail 

ai 

a2 

Actual  Pass 

0i 

L{9  i,ai) 

0 

£(01,  d2) 
$200,000 

Actual  Fail 

02  : 

L(02,ai) 

$800,000 

L{02 ,  a2) 

0 

As  usual,  the  two  correct  decisions  do  not  result  in  any  loss  at  all  so  they  are  assigned  a  cost  of  $0.  The 
incorrect  decision  of  predicting  a  fail  when  the  candidate  would  have  passed  is  assessed  a  cost  of  $200,000.  This 
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cost  is  typically  rather  difficult  to  assess  because  it  involves,  among  other  things,  the  moral  cost  of  denying  a 
career  path  to  a  qualified  candidate.  However,  in  this  example,  it  is  viewed  as  a  more  serious  incorrect  decision  to 
predict  a  pass  when  the  student  eventually  fails  training.  This  cost  is  somewhat  easier  to  determine  because  it 
essentially  involves  the  known  training  costs  for  a  candidate  through  some  pipeline  and  through  some  stage  of 
training.  In  an  ideal  situation,  these  costs  would  be  determined  through  a  detailed  economic  analysis  conducted  by 
experts  in  training  and  selection.  In  any  case,  the  exact  costs  are  not  necessary,  only  the  ratio  of  the  two  costs.  In 
the  PPS,  the  main  utility  of  the  loss  matrix  may  be  in  allowing  users  to  assess  the  effects  of  different  costs  on  the 
decision  thresholds  in  a  “what-if ’  exercise. 

Given  this  specification  of  the  loss  matrix,  we  can  examine  the  prediction  algorithm  of  Equation  (13)  to  see 
that  the  right  hand  side  (rhs)  of  the  algorithm,  the  ratio  of  the  costs,  has  been  determined.  It  is  now  necessary  to 
assign  the  probabilities  of  pass  and  fail.  This  is  where  the  concept  of  the  uncertainty  about  the  states  of  the  world 
enters  the  picture.  Whether  or  not  a  candidate  is  going  to  successfully  complete  training  is  not  known  beforehand. 
This  future  outcome  is,  by  its  very  nature,  uncertain. 

A  probability  assignment  helps  capture  this  uncertainly.  A  real  number  (as  opposed  to  a  complex  number) 
between  0  and  1  is  assigned  to  the  probability  of  a  pass  and  then  1  minus  this  number  is  assigned  to  the 
probability  of  a  fail.  A  number  closer  to  1  indicates  more  certainty  about  the  candidate  passing  training,  while, 
conversely,  a  number  closer  to  0  indicates  more  certainty  about  the  candidate  failing  training.  Numbers  in  the 
middle  range  close  to  .50  reflect  a  greater  degree  of  uncertainty  about  the  outcome. 

Subjecting  the  candidate  to  a  selection  test  battery  yields  relevant  information  that  will  result  in  a  better 
assignment  of  a  probability  for  success  than  if  this  information  were  not  available.  The  PPS  will  incorporate  all  the 
information  from  all  the  various  tests  that  comprise  the  test  battery  in  an  attempt  to  make  an  optimal  assignment  to 
the  probability  of  a  pass  to  arty  given  candidate.  That  is,  we  hope  that  this  information  from  the  test  battery  will 
drive  the  probability  assignment  closer  to  1  or  0  than  if  we  didn’t  have  this  information.  We  will  have  much  more 
to  say  on  this  issue  in  upcoming  sections.  For  right  now,  we  just  assign  these  probabilities  rather  cavalierly  in  order 
to  proceed  with  the  numerical  examples.  For  this  first  example,  suppose  that  P(Pass)  =  .75  and  P(Fail)  =  .25. 

Repeating  the  prediction  algorithm 

[f  P(Pass)  Ci 
P(Fail)  -  Ci 

Then  Predict  Pass 

Else  Predict  Fail 

and  then  substituting  the  numbers  from  the  above  discussion  results  in 

P(Pass)  .75 

P(Fail)  ~  .25 

=  3 


3  >  4  is  FALSE 

Therefore,  the  prediction  algorithm  outputs  a  PREDICT  FAIL.  In  this  case,  a  probability  of  passing  equal  to  75% 
is  simply  not  high  enough  to  commit  to  a  decision  to  predict  a  pass.  This  is  due  to  the  high  cost  of  the  wrong 
decision  to  let  a  candidate  into  training  when  he  or  she  fails. 

Returning  to  first  principles  to  calculate  the  expected  loss  for  both  decisions  can  provide  a  check  on  the 
correctness  of  the  prediction  algorithm.  Using  the  abbreviation  EL  for  expected  loss,  and  because  predicting  a  pass 
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is  action  ai,  we  have  for  the  n  —  2  states  of  the  world 


EL(a  i) 


Y,L(Oj,a1)P(0j) 
j  =  1 


=  [  $0  x  P(Pass)  ]  +  [  $800, 000  x  P(Fail)  ] 
=  [  $0  x  .75  ]  +  [  $800, 000  x  .25  ] 

=  $200,000. 


The  expected  loss  for  predicting  fail,  action  a 2,  is  likewise, 

n 

EL{a2)  =  L(6j,  a2)P{0j) 

3- 1 


=  L(e1,a2)P(e1)  +  L(e2,a2)P(62) 


=  [  $200, 000  x  P{ Pass)  ]  +  [  $0  x  P(Fail)  ] 
=  [$200,000  x  .75]  + [$0  x  .25] 

=  $150,000 


SDT  says  to  take  that  decision  which  results  in  the  minimum  expected  loss.  Because  the  expected  loss  for  action 
a2,  predict  fail,  is  less  than  the  expected  loss  for  action  ai,  predict  pass,  ($150,000  compared  to  $200,000)  the  PPS 
algorithm  predicts  fail. 

If  a  candidate’s  probability  of  passing  based  on  the  test  scores  were  raised  to  90%  from  the  75%  of  the 
previous  example,  then 


P(Pass) 

>  £ 

P(  Fail) 

~  Ci 

9 

>  4  is  TRUE 

and  the  candidate  would  be  admitted  into  training  as  a  predicted  pass.  Given  the  particular  losses  assigned  to  the 
incorrect  decisions,  we  can  see  that  the  threshold  probability  of  passing  based  on  test  battery  data  must  be  at  least 
80%.  Anything  lower  results  in  the  PPS  predicting  fail. 

As  the  ratio  of  losses  for  these  incorrect  decisions  climbs,  it  becomes  even  harder  for  a  candidate  to  get 
accepted  into  training.  As  the  following  numerical  example  illustrates,  an  increased  ratio  of  C2  to  Cl  raises  the 
threshold  for  acceptance  into  training  even  higher.  Table  4  presents  a  scenario  when  training  is  very  expensive,  but 
the  pool  of  qualified  applicants  wishing  to  be  trained  is  large.  In  this  case,  the  monetary  loss  associated  with  the 
incorrect  decision  to  predict  a  pass  when  the  outcome  is  a  failure  in  training  has  been  raised  to  $1,350,000.  The 
monetary  loss  with  the  other  incorrect  decision  to  predict  a  failure  when  the  candidate  would  have  successfully 
completed  training  is  lowered  to  $150,000.  For  this  altered  situation, 

P( Pass)  >  g 

P(Fail)  “ 
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Table  4:  The  loss  matrix  for  the  decision  problem  of  choosing  candidates  to  enter  flight  training  when  training  is 
very  expensive,  but  the  pool  of  qualified  applicants  is  large. 


Predict  Pass 

Predict  Fail 

ai 

a2 

Actual  Pass 

h 

L(6x,ai) 

0 

L(0U  af) 
$150,000 

Actual  Fail 

02 

i(02,ai) 

$1,350,000 

£(02,  0.2) 

0 

therefore,  the  probability  of  passing  given  performance  on  the  test  batteiy  must  be  90%  or  greater  for  the  decision 
rule  to  recommend  acceptance  into  training.  Going  back  to  first  principles,  this  means  that  the  average  loss  for 
predicting  a  pass  for  any  probability  of  passing  less  than  90%  is  greater  than  the  average  loss  of  predicting  a  fail. 

Of  course,  the  change  in  the  decision  criterion  can  work  just  as  well  in  the  opposite  direction.  If  the  losses  for 
the  two  incorrect  decisions  were  judged  to  be  of  equal  value  as  in  Table  5  below,  then  the  ratio  of  costs  would 
change  to 


As  soon  as  the  ratio  of  the  probability  of  passing  to  the  probability  of  failing, 

P(Pass) 

P(Fail) 

becomes  greater  than  1,  a  predicted  pass  results.  A  probability  of  passing  equal  to  50%  or  greater  would  be 
sufficient  to  predict  a  pass  in  this  case. 


Table  5:  The  loss  matrix  for  the  decision  problem  of  choosing  candidates  to  enter  flight  training  when  it  is  equally 
as  costly  to  reject  someone  who  could  have  passed  as  it  is  to  admit  someone  who  fails. 


Predict  Pass 

Predict  Fail 

ai 

&2 

Actual  Pass 

0i 

L{0i,ai) 

0 

L(0i,  af) 
$500,000 

Actual  Fail 

02 

£(02,  ai) 

$500, 000 

£(0 2,  af) 

0 

The  threshold  used  by  the  PPS  for  admitting  a  candidate  moves  in  the  expected  direction.  When  training  costs 
are  high  and  the  pool  of  applicants  is  large  so  that  the  replacement  cost  for  rejected  candidates  is  lowered,  better 
scores  on  the  test  batteiy  are  demanded.  The  threshold  is  placed  at  a  higher  level;  for  example,  the  probability  of 
passing  based  on  the  test  battery  results  might  have  to  be  higher  than  90%.  On  the  other  hand,  if  for  whatever 
reason  the  cost  of  rejecting  someone  who  later  would  have  passed  is  comparable  to  the  training  costs,  then  the 
threshold  is  lowered  and  the  decision  to  accept  a  candidate  could  be  lowered  to  a  probability  of  passing  based  on 
test  battery  results  of  50%. 

THE  BAYESIAN  APPROACH 

We  now  address  the  second  pillar  of  the  statistical  foundations  of  the  PPS.  Up  to  this  point,  we  have  used  a 
loose  notation  for  P( Pass)  and  P(Fail).  Each  of  these  probabilities  is  actually  conditioned  on  data  from  the  test 
battery,  so  the  notation  must  express  this  fact. 
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During  the  development  of  the  PPS  (or  any  selection  test  battery),  there  is  a  validation  phase  where  data  are 
collected  on  N  subjects.  No  actual  selection  takes  place,  all  N  candidates  participate  in  the  test  battery  and  then 
enter  training  whatever  their  score.  At  some  point,  the  validation  phase  is  completed  and  the  system  is  then  brought 
into  an  operational  mode  where  actual  selections  take  place.  Assuming  this  occurs  for  the  (N  +  l)st  candidate,  the 
correct  notation  for  the  probability  of  pass  and  probability  of  fail  in  the  prediction  algorithm  should  be 


P(?ass\DN+1,DN)  and  P(Fail|.Djv+r,  DN). 


Here  Di,  £>2,  •  •  •  £>/v  is  the  notation  for  the  data  in  the  database  from  the  N  candidates  for  whom  both  the  scores 
on  the  test  battery  and  the  training  outcome  are  known.  Then,  Dm+ 1  is  the  notation  for  the  data  of  the  candidate 
currently  being  tested  and  whose  training  outcome  is  unknown. 

The  addition  of  the  solid  vertical  line  to  a  general  probability  expression  like  P(A\1)  is  the  “conditioned  upon” 
symbol.  It  says  that  the  assigned  probability  for  the  proposition  that  appears  to  the  left  of  the  symbol.  A,  is 
conditioned  on  the  truth  of  what  appears  to  the  right  of  the  symbol,  1.  For  our  current  application,  this  means  that 
any  probability  for  passing  or  failing  is  conditioned  upon  both  the  receipt  of  the  ( N  +  l)st  candidate’s  data  from 
the  test  battery  and  what  has  happened  to  the  previous  N  candidates.  It  indicates  that  this  probability  of  passing  or 
failing  may  be,  and  most  likely  is,  different  from  any  other  probability  conditioned  on  something  else.  For  us,  the 
most  relevant  comparison  is  when  the  probability  assignment  conditioned  on  the  test  battery  data  is  contrasted  with 
a  probability  assignment  not  based  on  the  selection  system.  In  other  words,  the  question  is  whether  or  not  the 
selection  system  adds  relevant  information  that  raises  or  lowers  the  probability  of  passing. 


The  Bayesian  approach  is  specifically  designed  to  handle  this  situation.  It  uses  the  basic  axioms  from 
probability  theory  to  find  an  updated  probability  for  the  (N  +  l)st  candidate  conditioned  on  the  test  scores 
achieved  by  this  candidate.  One  easy  result  from  the  application  of  the  basic  axioms  of  probability  theory  is 
Bayes’s  Theorem,  written  as 


P(H\D)  = 


P{D\H)P{H) 

P(D) 


(14) 


The  proposition  H  usually  consists  of  a  number  of  mutually  exclusive  and  exhaustive  hypotheses,  labeled 
Hi,  Hi,  •  •  •  Hj,  •  •  •  Hk-  Bayes’s  Theorem  in  this  form  is  used  to  find  the  probability  of  the  ith  hypothesis,  Hi,  as 
conditioned  on  the  data,  D.  Then  Bayes’s  Theorem  is  rewritten  as 


P(Hi\D)  = 


P{D\Hi)PiHi) 

T,f=iP(D\Hj)PW 


(15) 


where  the  denominator  has  been  expanded  as  the  sum  of  all  the  terms  that  could  appear  in  the  numerator. 


Our  current  application  presents  us  with  only  two  mutually  exclusive  and  exhaustive  hypotheses 


Hi  =  PASS  and  H2=  FAIL 


and,  D,  the  data,  is  actually  the  data  from  the  (N  +  l)st  candidate,  DN+1.  Equation  (15)  then  assumes  the  forms 
shown  below  in  Equations  (16)  and  (17)  for  each  of  the  two  hypotheses, 


P(Pass|Z)Ar+i) 


P(Z£v-u  |Pass)  P( Pass) 

P(£>iv+ i|Pass)  P( Pass)  +  P(Djv+i|Fail)  P(Fail) 


(16) 


P(Fail\DN+i)  = 


P(Djv+i|Fail)  P(Fail) 

P(Djv+i|Pass)  P( Pass)  +  P(DiV+i|FaU)  P(Fail) 


(17) 


We  temporarily  suppress  explicitly  writing  out  the  information  from  the  previous  data,  Dm,  to  keep  the  formulas 
manageable.  Later,  we  will  reintroduce  the  symbol  to  see  how  the  Bayesian  formalism  handles  Dm. 
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The  prediction  algorithm  up  to  now  has  been  written  as 


Tf  P( Pass)  C2 

P(Fail)  -  C\ 

Then  Predict  Pass 

Else  Predict  Fail 

The  next  step  is  to  substitute  for  P(Pass)  and  P(Fail)  the  more  correct  probabilities  conditioned  on  the  data 
obtained  by  the  candidate  that  is  to  be  classified.  The  goal  is  to  form  the  ratio 

P(Pass|£>;y+i) 

P(Fail|P»JV+1)  ’ 

From  Equations  (16)  and  (17)  the  respective  denominators  cancel  out,  leaving 

P(Pass|Pjv+i)  =  P(E>Ar+i|Pass)  P(Pass)  ,  . 

P(Fail|D;v+i)  P(Dn+ 1  |Fail)  P(Fail)  '  1  ’ 

The  prediction  algorithm  now  reads 

If  ^aSf!^iV+1j  >  7^  then  Predict  Pass,  otherwise  Predict  Fail.  (19) 

P(Fail|Pjv-f  1)  Cx  v  1 


Many  of  the  prediction  algorithms  in  the  literature  stem  from  this  derivation.  However,  the  nomenclature  that  is 
employed  can  obscure  this  fact.  One  popular  form  of  a  prediction  algorithm  is  phrased  in  terms  of  a  likelihood 
ratio  and  a  response  threshold.  Our  prediction  algorithm  can  be  expressed  in  these  terms,  and  the  attempt  is  now 
made  to  allow  the  reader  to  translate  from  one  symbology  to  the  other. 

Substituting  Equation  (18)  into  the  revised  prediction  algorithm  of  Equation  (19)  yields, 


P(£>iv+i|Pass)  x  P( Pass)  ^ 
P(PV+1|Fail)  x  P(Fail)  “ 

P(-P.v+1|Pass)  > 
Ppiv+ilFail)  " 


C2 

Ci 

C2  P(Fail) 
Cx  P(Pass) 


(20) 

(21) 


The  left-hand  side  (lhs)  of  Equation  (21)  is  in  the  form  of  a  likelihood  ratio  because  it  is  the  ratio  of  the 
probability  density  of  the  data  given  that  it  came  from  the  Pass  group  over  the  probability  density  of  the  data  given 
that  it  came  from  the  Fail  group.  This  likelihood  ratio  is  written  as, 

=  P(P>jy+i  |Pass) 

[)~  P(£>jv+1jFail)  • 

The  rhs  of  Equation  (21)  is  a  function  of  the  costs  of  making  correct  and  incorrect  decisions  and  the  prior  odds  of 
failing  over  passing.  Together  they  make  up  3,  the  response  threshold,  where 

C2  P(Fail) 
p~  Ci  P(Pass) ' 

Equation  (21)  therefore  represents  a  decision  algorithm  in  the  form  of 


If  jC(x)  >  fj  then  predict  Pass,  otherwise  Predict  Fail 


(22) 
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Numerical  Examples 


The  rhs  of  the  new  prediction  algorithm.  Equation  (21),  is  dealt  with  first.  That  is,  we  calculate  the  value  of  /?. 
Then  we  turn  to  the  lhs  of  Equation  (21),  the  likelihood  ratio.  Take  the  costs  C\  and  C%  to  be  in  the  ratio  of  3:1  so 
that 

C2_- 

Ct 

For  example,  suppose  that  an  attrition  during  advanced  flight  training  costs  $900,000  and  the  cost  of  replacing  a 
PPS  rejected  candidate  is  $300,000.  P(Pass)  and  P(Fail)  are  probabilities  not  conditioned  upon  any  data  from  a 
selection  test  battery.  We  may  assign  these  probabilities  to  be  reflective  of  the  historical  record  for  passing  or 
failing  some  phase  of  flight  training  given  that  no  selection  test  battery  is  operational.  By  the  phrase  “no  selection 
test  battery  operating,”  we  mean  any  additional  selection  other  than  what  is  implemented  by  the  current  medical 
standards  and  intelligence  tests.  For  example,  Blower  [6]  has  shown  that  psychomotor  skills,  scores  from  pre-flight 
ground  school,  and  personality  evaluations  represent  cogent  information  that  would  change  the  assigned  probability 
of  success.  So,  the  assignments  to  P(Pass)  and  P(Fail)  do  not  reflect  this  kind  of  extra  information  that  could  be 
used  by  the  PPS.  They  are  simply  probabilities  assigned  to  Pass  and  Fail  that  do  not  take  into  account  scores  from 
a  test  battery,  for  example.  From  the  past  historical  records,  an  assignment  of 

P( Pass)  -  .75  and  P(Fail)  =  .25 
is  a  fairly  accurate  statement  through  advanced  flight  training. 

The  rhs  of  the  revised  form  of  the  prediction  algorithm  is  then 

=  ;;  PC  Fail) 

p  Ci  P(  Pass) 

3  .25 

1  X  .75 

=  1. 


With  this  particular  value  of  P  determined,  the  prediction  algorithm  looks  like 

If  C{x)  >  1  then  predict  Pass,  otherwise  Predict  Fail 


We  now  turn  to  address  the  lhs  of  the  prediction  algorithm  For  a  new  candidate,  this  is  the  ratio  of  the 
likelihood  of  the  data  from  whatever  additional  information  makes  up  the  new  selection  test  battery.  The  PPS 
would  calculate  C(x)  as  well  as  p  for  the  new  candidate  who  is  to  be  classified.  Specifically,  the  ratio  of  the 
likelihood  of  these  data  conditioned  upon  being  in  the  PASS  group  must  be  compared  to  the  likelihood  of  these 
data  conditioned  upon  being  in  the  FAIL  group. 

For  a  simple  first  example  that  seems  to  work  well  in  practice,  consider  that  these  likelihoods  are  well 
described  by  two  Normal  probability  density  functions.  The  shorter  word  curve  will  be  used  as  a  synonym  for 
probability  density  function.  Even  if  the  Normal  curves  are  not  analytically  the  correct  curves,  they  are,  in  many 
cases,  good  approximations  to  the  correct  underlying  curves.  For  example,  this  situation  occurs  when  a  composite 
score  is  constructed  from  many  other  scores  in  the  test  battery.  Using  a  technique  like  Discriminant  Analysis, 
such  a  composite  score  can  be  constructed  such  that  the  means  of  two  curves  are  as  far  apart  as  possible  while,  at 
the  same  time,  the  standard  deviations  are  made  equal.  These  composite  scores  are  also  assumed  to  follow  a 
Normal  curve  for  both  the  PASS  and  FAIL  groups.  After  examining  the  implications  of  the  prediction  algorithm 
for  this  simplification,  the  technical  justification  will  be  provided  in  Appendix  B. 

The  discussion  about  the  prediction  algorithm  is  easier  to  follow  when  accompanied  by  sketches  of  these 
Normal  curves  and  the  placement  of  the  response  threshold.  See  Fig.  1  for  an  initial  orientation.  In  this  sketch. 
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Figure  1:  Normal  curves  used  as  approximations  for  the  likelihood  of  the  composite  score  (Dn-\)  under  the  PASS 
and  FAIL  assumptions. 

there  is  one  Normal  curve  at  the  left  to  represent  the  predictive  distribution  of  the  composite  score  given  that  the 
candidate  is  a  failure.  Displaced  to  the  right,  is  the  comparable  Normal  curve  representing  the  predictive 
distribution  of  the  composite  score  assuming  the  truth  of  the  candidate  passing.  The  likelihood  for  each  curve  is 
simply  the  ordinate,  or  the  value  at  the  y-axis,  on  the  Normal  curve. 

The  candidate  participates  in  the  selection  test  battery  and  receives  a  specific  composite  score.  Drawing  a  line 
upwards  at  the  point  on  the  x-axis  representing  this  composite  score  intersects  both  Normal  curves.  For  a  “low” 
composite  score,  the  line  first  intersects  the  Normal  curve  for  the  PASS  group  and  then  intersects  the  Normal  curve 
for  the  FAIL  group.  The  y-axis  values  at  these  two  intersection  points  are  the  likelihoods  we  need  to  form  the 
ratio,  L(x)  as  shown  in  Fig.  2.  The  low  composite  score  is  given  a  value  of  —1.00  in  Fig.  2. 

For  a  “low”  composite  score,  the  y-axis  value  is  going  to  be  larger  for  the  FAIL  group  than  for  the  PASS 
group.  Therefore,  the  likelihood  ratio,  L(x),  because  it  is  a  ratio  of  the  likelihood  of  the  PASS  group  over  the 
FAIL  group,  must  be  less  than  1.  As  the  composite  score  gets  better  and  better,  it  eventually  reaches  the  point 
where  the  two  curves  intersect.  Here  the  y-axis  values  for  both  curves  are  equal  and  therefore  L(x)  =  1.  We  can 
imagine  this  process  of  obtaining  a  better  composite  score  continuing  so  that  eventually  the  line  first  intersects  the 
FAIL  Normal  curve  followed  by  intersecting  the  PASS  Normal  curve.  Then  the  y-axis  value  for  the  PASS  group  is 
larger  than  for  the  FAIL  group  and  the  likelihood  ratio,  L(x),  becomes  greater  than  1.  When  a  composite  score  is 
high  enough  to  reach  this  likelihood  ratio,  then  it  exceeds  the  value  of  the  response  threshold,  (3=1.  At  this  point, 
the  prediction  algorithm  will  output  a  PREDICT  PASS.  If  the  composite  score  obtained  by  the  candidate  is  not 
high  enough  to  reach  this  response  threshold,  then  the  likelihood  ratio  is  less  than  1  and  the  prediction  algorithm 
will  output  a  PREDICT  FAIL.  See  Figure  3  for  an  illustration  of  this  concept. 

Finding  the  Threshold  Score 

The  dividing  line  when  L(x)  =  1  in  Fig.  3  where  PREDICT  FAIL  is  distinguished  from  PREDICT  PASS 
occurs  at  the  threshold  score  of  0.  How  do  you  go  about  actually  finding  that  particular  composite  score?  We  will 
answer  that  question  shortly  in  a  long  algebraic  derivation,  but  it  is  easy  to  see  from  Fig.  3  that  x  lies  halfway 
between  the  mean  of  the  FAIL  group  and  the  mean  of  the  PASS  group.  In  Fig.  3,  the  mean  of  the  FAIL  group 
occurs  at  -1.00  and  the  mean  of  the  PASS  group  occurs  at  +1.00,  so  the  threshold  score  is  placed  halfway 
between  them  at  a  composite  score  of  0. 


11 


Figure  2:  Normal  curves  used  as  approximations  for  the  likelihood  of  the  composite  score  (Dn+i)  under  the  PASS 
and  FAIL  assumptions .  A  likelihood  ratio  less  than  1  and  equal  to  1  are  drawn  into  the  sketch. 


Threshold 

Composite  Score 

Figure  3:  Normal  curves  used  as  approximations  for  the  likelihood  of  the  composite  score  (D^+i)  under  the 
assumptions  of  PASS  and  FAIL.  Three  likelihood  ratios ,  the  first  less  than  1,  the  second  equal  to  1,  and  the  third 
greater  than  1  are  sketched  in  the  figure.  The  regions  for  the  composite  score  to  yield  a  predict  pass  or  a  predict 
fail  for  the  case  where  L(x)  =  1  are  also  shown. 
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We  have  talked  about  the  ordinate  of  the  Normal  curve  as  the  y-axis  value.  The  formula  for  the  ordinate  of  the 
Normal  curve  is 


(VO } 


(23) 


We  have  already  discussed  the  fact  that  the  threshold  response  for  /?  =  1  occurs  where  the  two  Normal  curves 
intersect.  Therefore,  their  ordinates  must  be  the  same  value. 


exp  ■ 


1  ( x  —  [Ip' 
1 


1 


<7  p 


\/27r op 


\f72moF 

Discriminant  Analysis  returns  composite  scores  with 

of  —  Qp  —  1 


exp  <  -- 


1  (x  -up' 


oP 


(24) 


permitting  the  leading  term  to  be  canceled.  Equation  (24)  then  simplifies  to 

exp{— 1/2  (x  —  fip)2}  =  exp{— 1/2  (x  —  lip)2}. 
Take  the  natural  logarithmic  transform  of  both  sides  of  Equation  (25)  to  yield 

-1/2  (x  -  hf)2  -  -1/2  (x  -  lip)2. 

Now  expand  the  quadratic  on  each  side  of  the  side  of  Equation  (26) 

—1/2  (x2  —  2xfip  +  li p)  =  —1/2  (x2  —  2 x/ip  +  lip). 
Multiply  the  terms  in  the  parentheses  by  —1/2 

— l/2x2  +  x/ip  —  ^1^11%  —  — l/2x2  +  xfip  —  1/2  tip. 

Cancel  the  leading  term 


xfip  —  l/2fi%  =  xfip  —  1/2  (ip. 


Collect  like  terms  on  the  appropriate  sides  of  Equation  (29) 

x(iiP  -  fiF)  =  1/2  (lip  -  tip). 
The  rhs  of  Equation  (30)  can  be  decomposed  into 

( lip  ~  Mf)  =  (Mr-  -  Mf)(Mp  +  Mf) 


leading  to 


(25) 

(26) 

(27) 

(28) 

(29) 

(30) 

(31) 


x(fiP-  hf)  =  1/2  (/ip-  iiF)(lip  +  /if).  (32) 

The  final  step  is  to  isolate  x  and  achieve  the  intuitive  answer  that,  yes,  the  threshold  composite  score  does  lie 
midway  between  the  means  of  the  PASS  and  FAIL  groups. 

x  =  1/2  (up  +  fip)  (33) 
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Changing  the  Response  Threshold 

The  response  threshold,  (3,  will  change  as  a  function  of  different  costs  and  different  prior  probabilities.  (3  could 
easily  change  from  the  value  of  1  as  used  in  the  numerical  example  given  above.  What  effect  does  changing  f3 
have  on  the  prediction  as  given  by  PPS? 

In  the  definition  of  ,6,  there  are  two  ratios:  1)  the  ratio  of  costs,  C^jCw  and  2)  the  ratio  of  prior  probabilities, 
P(Fail)/P(Pass).  Again,  the  phrase  prior  probability  is  used  only  as  a  shorthand  for  a  probability  that  must  be 
assigned  without  having  the  benefit  of  any  selection  test  battery  scores.  It  will,  in  fact,  be  based  on  other 
information  and  should  properly  be  written  as  P(Pass|I)  to  indicate  that  a  probability  for  passing  has  been 
assigned  given  the  truth  of  some  information  1. 

We  consider  each  ratio  making  up  j3  in  turn,  and  while  changing  one  ratio,  the  other  ratio  is  assumed  fixed  and 
nonchanging.  As  the  ratio  of  costs  increases,  it  becomes  relatively  more  expensive  to  experience  a  failure  during 
flight  training  than  to  replace  a  PPS  rejected  candidate  who  would  have  succeeded.  Suppose  that  C2/C1  moves 
from  3  to  6  to  9  to  12,  while  P(Fail|I)  divided  by  P(Pass|I)  remains  fixed  at  1/3.  Then  (3  increases  from  1  to  2 
to  3  to  4.  The  likelihood  ratio,  L( x),  must  now  exceed  an  increasingly  larger  number  in  order  to  predict  a  pass. 
The  y-axis  value  of  the  threshold  composite  score  for  the  PASS  group  must  get  larger  relative  to  the  y-axis  value 
of  the  FAIL  group,  and  this  can  only  be  accomplished  if  the  threshold  composite  score  is  increasing.  It  thus 
becomes  more  difficult  for  a  candidate  to  get  selected  because  of  the  higher  score  needed  for  the  PPS  to 
recommend  entry  into  training.  Conversely,  if  the  ratio  of  costs  decreases,  then,  by  the  same  reasoning,  the 
composite  score  needed  for  the  PPS  to  recommend  entry  is  driven  lower. 

In  the  first  case,  the  objective  is  to  decrease  the  number  of  failures  in  training  at  the  expense  of  increasing  the 
number  of  incorrect  rejections.  In  the  second  case,  the  objective  is  to  decrease  the  number  of  candidates  incorrectly 
rejected  at  the  expense  of  increased  training  failures.  Thus  creates  a  continuous  trade  off  between  these  two  errors 
as  we  change  the  cost  structure  in  the  loss  matrix.  Any  point  during  this  trade-off  is  justified  by  the  lower  expected 
loss  of  taking  this  actiou 

Due  to  the  multiplicative  nature  of  (3,  an  entirely  analogous  event  takes  place  if  we  keep  the  cost  structure  fixed 
and  manipulate  the  ratio  of  prior  probabilities.  If,  independently  of  any  selection  test  battery  scores,  the  probability 
of  passing  is  very  high,  say  P(Pass|T)  =  .90,  then  the  second  ratio  becomes  1/9.  (3  then  moves  lower  in 
comparison  with  the  situation  in  the  example  just  discussed  where  P(Pass|2j  =  .75  (assuming  that  the  costs 
remain  fixed).  A  lower  composite  score  than  before  can  result  in  a  predicted  pass.  In  a  sense,  the  test  battery 
becomes  less  relevant  when  the  probability  of  passing  without  the  test  scores  is  already  very  high.  On  the  other 
hand,  when  the  probability  of  passing  prior  to  the  implementation  of  selection  test  battery  is  low,  the  threshold 
composite  score  is  raised.  Achieving  a  good  score  on  the  selection  test  battery  then  becomes  more  important. 


\ 
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Appendix  A 

Formal  Derivation  of  the  Bayesian  Predictive  Density 
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This  appendix  contains  a  formal  proof  of  the  Bayesian  predictive  density  function.  The  derivation  starts  out 
with  a  generic  notation  to  make  it  easier  to  follow  each  individual  step  in  the  proof,  and  then  switches  at  the  end  to 
the  notation  that  we  have  used  throughout  the  paper.  The  proof  depends  solely  upon  the  product  rule  and  the  sum 
rule  from  probability  theory  and  Bayes’s  theorem,  which  itself  is,  in  turn,  derived  from  the  sum  and  product  rules. 

As  a  mental  pump  priming  for  the  main  proof,  we  first  present  a  simpler  proof  which  has  all  the  necessary 
elements,  but  in  a  more  digestible  form.  Consider  the  joint  probability  of  two  propositions  A  and  B  written  as 
P(A,  B).  Proposition  B  can  be  broken  down  into  K  mutually  exclusive  and  exhaustive  subpropositions 
Bi,  f?2,  •  •  • Bj Bk ■  By  the  sum  rule. 


K 

(Al) 

3= 1 

By  the  product  rule,  the  joint  probability  in  the  rhs  of  Equation  (Al)  can  be  written  as 

P(A,Bj)  =  P(A\Bj)P(Bj)  (A2) 

so  we  have  from  Equations  (Al)  and  (A2), 


K 

P(A)  =  '£P(A \Bi)p(Bj)-  (A3) 

j=i 

Equation  (A3)  becomes  more  transparent  when  proposition  A  stands  for  the  data  D  and  subpropositions 
Bi,  B2)  ■■■  Bj  ■  Bk  for  the  K  hypotheses.  Then  Equation  (A3)  reads 

K 

p(d)  =  £p(d (A4> 

j=i 

which  is  the  denominator  in  Bayes’s  Theorem  and  explains  the  transition  in  the  denominator  from  Equation  (14)  to 
Equation  (15)  in  the  main  text. 

The  proof  for  the  predictive  density  follows  along  the  same  lines,  although  it  is  a  little  trickier  because  of  the 
many  manipulations  needed  to  get  all  the  symbols  in  the  right  place.  Instead  of  just  two  propositions  A  and  B  in 
the  example  above,  we  now  have  four  propositions.  I  have  found  it  easier  to  derive  the  predictive  equation  by  first 
using  a  generic  notation,  w,  x ,  y  and  2,  for  the  four  propositions,  and  then  substituting  the  notation  we  have  used 
throughout  the  paper  at  the  final  step. 

To  begin,  write  out  the  joint  probability  for  all  four  propositions  under  consideration,  P(w,  x,  y,  z),  just  as  we 
did  above.  Again,  following  the  same  pattern  as  above,  use  the  product  rule  to  form 

P(w,  x,  y,  z)  =  P(w\x,  y,  z)  P(x\y ,  z)  P{y\z)  P(z).  (A5) 

(A6) 


(A7) 


(A8) 


Divide  both  sides  by  P(z), 


=  pH x,y’ z)  p(y\z)- 


By  Bayes’s  Theorem,  the  lhs  of  Equation  (A6)  is  also  equal  to 

P(w,x,  y,  z) 


P(z) 


=  P(w,x,y\z). 


Therefore,  equating  the  rhs  of  Equations  (A6)  and  (A7)  yields 

P(w,x,y\z)  =  P(w\x,y,z)  P{x\y,z)  P{y\z). 
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Divide  both  sides  of  Equation  (A8)  by  P{y\z)y 


P(w,  x,y\z) 

P{y\z) 


=  P(w\x,y,  z)  P(x\y,  z). 


Now  take  the  rhs  of  Equation  (A9)  and  use  the  product  rule  in  reverse, 


(A9) 


P(w\x ,  y,  z)  P(x\y ,  z)  =  P(w,  x\ y,  z).  (A10) 

The  lhs  of  Equation  (A9)  is  equal  to  the  rhs  of  Equation  (A10).  Therefore, 


P(w,x,y\z) 

P{y\z) 


P(w,x\y,  z). 


(All) 


Use  the  sum  rule  on  the  rhs  of  Equation  (All) 

J  P{w ,  x\ y,  z)  dx  =  P(w\y,  z) .  (A12) 

In  the  first  derivation  that  began  this  appendix,  the  K  discrete  values  of  B  were  summed  over  in  Equation  (Al)  to 
find  A.  This  operation  is  known  in  the  Bayesian  jargon  as  “eliminating  a  nuisance  parameter.”  In  Equation  (A12), 
the  nuisance  parameter  x  was  eliminated.  Because  this  parameter  will  turn  out  to  be  continuous,  perform  an 
integration  instead  of  the  summation  Substitute  the  lhs  of  Equation  (A10)  for  the  integrand  in  the  lhs  of  Equation 
(A12), 


P(w\y,  z)  = 


J  p(w\x,  y,  z)  P(x\y,  z)  dx. 


For  the  final  step,  make  w  independent  of  y. 


P(w\z)  =  J  P(w\x,  z)  P(x\y,  z )  dx. 


(A13) 


(A14) 


This  completes  the  proof  for  the  Bayesian  predictive  density  where  w  is  the  new  data  and  z  is  the  state  of  the 
world  for  which  this  new  data  is  assumed  true,  x  is  the  parameter,  or  set  of  parameters,  in  the  model  which 
describe  how  the  data  are  generated,  and  y  is  all  the  past  data.  Explicitly  matching  up  this  generic  notation  with 
the  notation  of  the  particular  problem  being  treated  in  this  paper,  we  have 

w  =  Dn+i 

x  =  X 

y  =  DN 

z  =  Pass  or  Fail 

\ 

Dn+i  stands  for  the  data  from  the  selection  test  for  a  particular  candidate  whose  training  outcome  we  wish  to 
predict,  /Ty  stands  for  the  data  from  the  selection  test  and  known  training  outcomes  for  the  validation  sample 
consisting  of  the  previous  N  subjects,  and  A  stands  for  the  set  of  parameters  in  the  model  which  tell  us  how  the 
data  came  about.  The  conditioning  information  z  can  take  on  only  two  values,  either  PASS  or  FAIL.  If  y  —  PASS, 
then  Equation  (A14)  looks  like 

P(Dn+ i|Pass)  =  J  P(DN+i\X, Pass)  P(X\DN,  Pass)  dX  (A15) 
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in  the  standard  notation  used  throughout  the  main  text.  Similarly,  if  z  =  FAIL,  then  Equation  (A14)  looks  like 

P(jDjv+i[Fail)  =  J  P(Dn+ i  | A,  Fail)  P(A|P>w,Fail)  dX.  (A16) 

The  likelihood  ratio,  £( x)  demands  that  we  form  two  predictive  densities,  one  conditioned  on  the  PASS  group  and 
one  conditioned  on  the  FAIL  group. 

This  is  what  we  set  out  to  accomplish.  In  words,  the  final  result  of  Equations  (A15)  or  (A16)  says  to  first  set 
up  the  likelihood  of  the  data  from  the  candidate  being  tested  as  conditioned  on  the  set  of  parameters,  P(Dn+i  |A), 
and  then  multiply  by  the  posterior  probability  of  the  parameters  as  conditioned  on  all  of  the  previous  data, 
P(A|D/v).  This  multiplication  is  done  over  the  entire  range  that  the  parameters  can  assume,  and  then  the  results 
are  summed.  In  other  words.  Equations  (A15)  and  (A16)  are  asking  for  the  average  likelihood  of  D;v+i  with  the 
weighting  function  being  the  posterior  probability  of  the  parameters.  In  any  case,  the  result  is  the  Bayesian 
predictive  density  of  the  score,  or  scores,  from  the  test  batteiy  for  the  candidate  currently  undergoing  selection. 
Technically,  Equations  (A15)  and  (A16)  should  replace  the  curves  shown  in  Figs.  1-3.  Although,  as  shown  in 
Appendix  B,  the  rigorous  application  of  Equations  (A15)  and  (A16)  for  one  common  case  results  in  curves  that  are 
very  similar  to  the  idealized  Normal  curves  already  studied. 

If  these  predictive  densities  turn  out  to  be  too  difficult  to  solve  analytically,  we  can  still  obtain  an  answer 
through  a  numerical  approximation  to  the  integral.  An  example  of  a  computer  program  to  perform  such  a 
numerical  approximation  to  the  predictive  density  can  be  found  in  Blower  [7].  The  essence  of  the  computer 
program  is  to  calculate  an  average  as  just  described.  The  likelihood  of  the  composite  score  for  the  new  candidate 
is  calculated  over  many  values  of  the  parameters.  If  the  number  of  parameters  is  small,  as  in  the  Normal  case 
when  there  are  only  two  parameters,  A=  {/x,  a},  then  a  grid  can  be  constructed  over  a  reasonable  range  of  the 
parameters.  At  each  grid  value,  the  posterior  density  function  of  the  parameters  is  calculated.  These  values  serve 
as  the  weights  that  form  the  average  of  the  likelihood.  After  each  likelihood  is  multiplied  by  its  respective  weight 
over  many  values  of  the  parameters,  this  sum  is  divided  by  the  sum  of  the  weights.  The  resulting  value  is  a  good 
estimate  of  the  integral  in  Equations  (A15)  and  (A16). 

After  obtaining  the  predictive  densities,  the  prediction  algorithm  can  be  written  in  its  most  general  form. 
Because  the  likelihood  ratio  is  equal  to 

rM  _  £[£n+j jPass) 

()  P(Dn+ ijFtrilf 

Equation  (21)  in  the  main  part  of  the  paper  can  be  reexpressed  as 

fRP{DN+i\\,Fass)  P{\\DN,Pass)  d\  ^  Cx  P(Fail) 

/fiP(Djv+i|A,Fail)  P(A|£>./v,Fail)  dA  ~  C2X  P(Pass)’  (  } 
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Appendix  B 

An  Analytical  Solution  for  a  Bayesian  Predictive  Density 
that  Justifies  the  Use  of  the  Normal  Curve 
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This  appendix  shows  how  to  derive  an  analytical  solution  for  either  one  of  the  integrals  on  the  lhs  of  Equation 
(A17).  The  solution  is  for  the  circumstances  treated  in  the  text  as  shown  in  Figs.  1-3  where  we  assumed  Normal 
distributions  for  the  predictive  densities.  The  derivation  of  a  predictive  density  like  this  is  well  known  in  the 
Bayesian  literature.  We  follow  closely  the  account  given  by  Zellner  [8] . 

Assume  that  both  the  data  Dn  from  the  previous  N  candidates  and  the  data  Dn+ i  provided  by  a  new 
candidate  are  discriminant  scores  arising  from  the  use  of  a  discriminant  analysis.  Each  candidate  receives  one 
discriminant  score  that  is  a  linear  weighting  of  the  many  test  scores  comprising  the  test  battery.  The  weights  are 
determined  by  the  discriminant  analysis.  Such  summary  scores  are  usually  called  composite  scores  in  the  PPS. 
Discriminant  analysis  constructs  these  composite  scores  such  that  they  are  distributed  according  to  a  Normal  pdf 
with  a  standard  deviation  of  1.  The  mean  of  the  PASS  group  and  the  mean  of  the  FAIL  group  are  separated  as 
much  as  possible  given  the  data  from  the  training  outcomes. 

If  the  composite  score  for  the  i  th  candidate  is  labeled  as  dt .  then  the  Normal  pdf  is 


Normal  pdf 


exp  < 


V^7TOp 

just  as  in  Equation  (23).  By  construction,  av  —  1,  so  the  pdf  can  be  shortened  to 

Normal  pdf  =  --==  exp  j-i  (d*  -  pp)2  j  . 


(Bl) 


(B2) 


The  validation  sample  consists  of  Np  candidates  with  composite  scores  dj.,  d2  •  •  •  d^p .  These  scores  come  from 
only  those  candidates  who  passed,  as  the  subscript  p  indicates.  Assuming  independence  of  scores  from  the  Np 
different  subjects, 


P(d1,d2---dNp\pp,op=  l)  =  -^=exp|-i  (di -  /rp)2| 

^exp{-i(d2-C,)2} 


Nv 


V2tt) 


1  )  exp<^ -i^(di-pp)2 


The  term 


(v^f) 


N„ 


(B3) 


in  Equation  (B3)  is  a  constant  value  for  the  fixed  value  of  Np  and  thus  can  be  absorbed  into  a  proportionality 
factor.  Therefore,  the  likelihood  is  proportional  to  the  exponential  factor, 


[  ) 

P(di,d2---dNp\fxp,ap  =  1)  cx  exp  j  --  -pp)2  > 


(B4) 
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We  now  show  that 


Np 

y>t  -  np)2  —  Np(pp  -  d)2  +  Constant 

i— 1 

where  d  is  the  sample  average  composite  score  over  the  Np  candidates  who  passed. 

Np 

^2(di  -  lip)2  -  ^  [  (dj  -  d)  -  (, fip  -  d)  ]2 

i= l 

=  Y,  I  (di  -  d?  “  2(di  ~  5H/b>  -  d)  +  (/iP  -  d)2  ] 


-  £(d*  -  d)2  -  2^(di  -  d)(/ip  -  d)  +  £>p  -  d)2 

=  S(di-d)2+5^(/ip-d)2 

=  ^(di-d^  +  iVp^p-d)2 

=  Np  (fip  -  d) 2  +  Constant. 

The  first  line  subtracts  a  constant,  the  sample  mean  of  the  composite  scores,  d,  and  then  adds  the  same  constant  to 
the  expression.  Then  in  the  second  line,  the  expression  in  brackets  is  expanded  by  squaring.  In  the  third  line,  the 
summation  sign  is  distributed  across  the  three  terms  in  the  expression  The  middle  term  works  out  to  be  zero 
because 

X)(di-d)  =  y^-ATpd 


=  Np  x 


x 

.  np 


Npd 
Np  . 


=  Np  x  (d  —  d) 


=  0 


which  results  in  the  fourth  line.  The  first  term  is  again  a  constant  for  any  fixed  set  of  data,  being  the  sum  of  sample 
squared  deviations.  It  can  also  be  absoibed  into  the  proportionality  factor.  So  Equation  (B4)  can  now  be  written  as 

P{di, d2  •  •  •  djVp |/ip, <7P  =  1)  oc  exp  j-i/Vp(/Zp  -  d)2j  .  (B5) 

By  Bayes’s  Theorem, 

P{fip,aP\DN)  oc  P{DN\p.p,Op)  P{iip,ap) 

Since  ap  =  1,  the  prior  probability  for  the  joint  occurrence  of  the  two  parameters  is  usually  assigned  as  a  constant 
k  in  the  Bayesian  approach.  The  same  range  will  be  assigned  for  the  joint  occurrence  of  (///,  oy).  Therefore,  the 
posterior  probability  of  the  parameters  is 

P(lip,Op\DN)  oc  P(Dn\iip,  ap)P(nP,Op) 
cc  P{DN\nP,Op)  x  k 

oc  exp  l-^Np(iip -d)2\ . 
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Solving  the  formula  for  the  predictive  density  given  the  PASS  group 


/  P{Dn+ 1 1  A,  Pass)  P  ( A  |  DN ,  Pass)  dX 
Jr 

is  our  eventual  goal.  We  have  just  found  one  of  those  two  terms,  the  posterior  probability  of  the  parameters,  as 

P(A|D^,Pass)  oc  exp  j-~iVp(,up  -  d)2  j 


where 


X  —  {/+,,  <7p}~ 

From  the  same  assumptions  that  started  out  this  appendix,  the  likelihood  of  the  new  composite  score,  DN+1, 
obtained  from  the  candidate  currently  being  tested,  is 


P{Dn+i |mp> api Pass)  oc  exp  |-i(Z>jv+1  -  /zp)2  j . 


Therefore, 


J"  P{Dn+i\^pi  °p)  P ?  +> ! DN )  dfi  do  J  exp  ^  ^ ( Dn  +  i  /+)  ^  X  exp  ^  Np(fip  d)  ^  d\xdo 

=  J  exp  { - i {(Dn+i  -  Up)2  +  Np(np  -  d)2]  |  dfi  da.  (B6) 

An  analytical  solution  to  this  integral  was  worked  out  by  Zellner  [8],  In  our  notation,  this  solution  is 

P(Dn+1  |Pass)  =  ,  1  =  exp  ( - i  ~J)2 1 

V  N+U  V^F(I  +  TJNpj  P\  2(1  +  1  /Np)  J 


This  is  a  Normal  distribution  for  the  composite  score,  Dn+i,  as  obtained  from  the  new  candidate  we  wish  to 
classify  correctly.  It  is  centered  on  the  sample  mean  of  the  composite  scores,  d,  with  a  variance  inflated  by  a  factor 
related  to  the  sample  size  of  the  PASS  group,  1  +  As  you  can  see,  when  the  sample  sizes  Np  and  Nf  become 
large,  the  predictive  density  functions  are  essentially  the  same  as  shown  earlier  in  Figs  1-3.  Those  Normal  curves 
were  the  same  as  Equation  (B7)  when  N  — >  oo.  It  is  only  when  the  sample  sizes  are  small  that  Equation  (B7)  will 
provide  the  necessary  correction  for  the  likelihood  ratio.  Even  in  this  case,  the  correction  is  much  smaller  than 
might  have  been  expected. 

In  the  earlier  example,  when  the  sample  mean  for  the  composite  scores  of  the  FAIL  group  occurred  at  —1.00 
and  the  sample  mean  for  the  composite  scores  of  the  PASS  group  occurred  at  +1.00,  the  threshold  score  was 
placed  at  0.00  for  (3  —  1.  Suppose  that  each  sample  mean  was  based  on  a  small  number  of  candidates.  Let 
Np  =  30  and  Nf  —  15.  Now  we  want  to  find  that  value  of  Djv+i  such  that 


C(x)  = 


P{Dn+ i  =?]Pass) 
P(DN+1  =?|Fail) 


or,  equivalently,  where 

P(Dn+i  =?|Pass)  =  P(Dn+1  =?|Fail). 

Using  Equation  (B7)  and  the  analogous  equation  based  on  the  FAIL  group,  we  find  that  equality  holds  between  the 
two  predictive  functions  if  Dn+i  =  —.00022.  Therefore,  there  is  only  a  very  slight  adjustment  to  the  threshold 
score  from  its  former  value  of  0  to  accommodate  the  small  sample  sizes. 
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If  (3  changes  so  that  the  threshold  score  is  further  out  on  the  predictive  density  functions,  then  the  change  in  the 
threshold  score  will  be  little  larger  for  small  sample  sizes  but  still  not  very  dramatic.  For  example,  if  ^  —  3,  the 
threshold  score  is  placed  at  +0.55  for  the  idealized  case  but  only  moves  up  to  +0.58  when  the  correct  formula 
taking  account  of  the  small  sample  sizes  is  used.  Therefore,  a  new  candidate  would  have  to  obtain  a  composite 
score  Z?jv+i  >  +0.58  in  order  to  be  a  predicted  pass. 

The  moral  of  this  appendix  is  that  for  composite  scores  based  on  any  reasonable  sample  size,  the  predictive 
density  for  each  group  can  be  taken  as  the  Normal  distribution  centered  at  the  sample  means  of  each  group  with  a 
standard  deviation  of  1.00.  The  kind  of  thorough  Bayesian  analysis  derived  in  this  appendix  justifies  the  standard 
practice  illustrated  in  the  final  section  of  the  main  part  of  the  paper. 

If  you  are  not  willing  to  accept  a  standard  deviation  of  1.00  for  the  composite  scores,  Blower  [7]  has  shown 
how  the  same  Bayesian  approach  explained  here  can  be  used  to  take  account  of  the  uncertainty  in  both  the  sample 
means  and  sample  standard  deviations.  As  you  might  expect,  the  threshold  score  undergoes  a  greater  adjustment  in 
this  case  but  still  doesn’t  move  very  far  from  the  idealized  approximation  presented  earlier.  In  any  case,  one  can 
compute  a  numerical  approximation  for  the  integrals  in  Equation  (A17)  for  any  situation  where  an  analytical 
solution  does  not  exist  or  is  too  hard  to  find.  The  likelihood  ratio,  £(x),  can  be  found  by  forming  the  ratio  of  two 
such  numerical  approximations,  and  the  prediction  algorithm  will  continue  to  function  under  any  circumstances. 

Of  course,  since  the  algorithm  is  outputting  a  binary  decision,  only  likelihood  ratios  very  close  to  the  threshold 
score  would  have  to  be  calculated  very  exactly  anyway.  Or,  stating  it  in  another  way  more  in  line  with  the  actual 
implementation,  only  composite  scores  very  near  to  the  threshold  score  would  be  affected  in  terms  of  a  predicted 
pass  or  predicted  fail. 
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